Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(batch-exports): replace invalid unicode with ? #21174

Merged
merged 1 commit into from
Mar 27, 2024

Conversation

bretthoerner
Copy link
Contributor

Problem

orjson is very strict about Unicode (but doesn't seem to provide a way to do error replacement on its own). Users have hit this in prod.

Changes

Catch rare errors, do Unicode error replacement, then use orjson.dumps again.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes.

How did you test this code?

Unit.

Copy link
Contributor

@tomasfarias tomasfarias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for looking into this.

Copy link
Contributor

@tomasfarias tomasfarias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to do this for the JSONLines writer down below too.

return orjson.dumps(d, default=str)
try:
return orjson.dumps(d, default=str)
except (UnicodeEncodeError, TypeError):
Copy link
Contributor

@tomasfarias tomasfarias Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange that the exception raised doesn't match the docs:

If orjson.dumps() is given a str that does not contain valid UTF-8, orjson.JSONEncodeError is raised. If loads() receives invalid UTF-8, orjson.JSONDecodeError is raised.

However, I verified and it's indeed raising a TypeError (and obviously the unit test covers it). Wondering if we should be safe and catch orjson.JSONEncodeError just in case this is a bug that's fixed upstream at some point.

Copy link
Contributor

@tomasfarias tomasfarias Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've figured it out: orjson.JSONEncodeError is an alias of TypeError (see: https://github.com/ijl/orjson/blob/master/src/typeref.rs#L197, and this issue). So, really, orjson is raising a TypeError.

Copy link
Contributor

@tomasfarias tomasfarias Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that this:

try:
    return orjson.dumps(d, default=str)
except orjson.JSONEncodeError:
    ...

Would also work (since orjson.JSONEncodeError is TypeError).

Copy link
Contributor

@tomasfarias tomasfarias Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (nitty) thing though: UnicodeEncodeError is not the exception being raised, but the __cause__, so I would consider dropping that from the except and keeping just TypeError or orjson.JSONEncodeError.

@tomasfarias
Copy link
Contributor

thought: We really need to stop doing JSON unless user explicitly requests JSON in blob storage exports...

@bretthoerner bretthoerner force-pushed the brett/batch-export-unicode branch from 19a9068 to 53a8418 Compare March 27, 2024 14:06
@bretthoerner bretthoerner merged commit 7cff949 into master Mar 27, 2024
129 checks passed
@bretthoerner bretthoerner deleted the brett/batch-export-unicode branch March 27, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants